NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

SSD failures in the field: symptoms, causes, and prediction models

https://doi.org/10.1145/3295500.3356172

Alter, Jacob; Xue, Ji; Dimnaku, Alma; Smirni, Evgenia (November 2019, Supercomputing 2019)

Full Text Available
Fill-in the gaps: Spatial-temporal models for missing data

https://doi.org/10.23919/CNSM.2017.8255983

Xue, Ji; Nie, Bin; Smirni, Evgenia (November 2017, 13th International Conference on Network and Service Management, CNSM 2017)

Effective workload characterization and prediction are instrumental for efficiently and proactively managing large systems. System management primarily relies on the workload information provided by underlying system tracing mechanisms that record system-related events in log files. However, such tracing mechanisms may temporarily fail due to various reasons, yielding “holes” in data traces. This missing data phenomenon significantly impedes the effectiveness of data analysis. In this paper, we study real-world data traces collected from over 80K virtual machines (VMs) hosted on 6K physical boxes in the data centers of a service provider. We discover that the usage series of VMs co-located on the same physical box exhibit strong correlation with one another, and that most VM usage series show temporal patterns. By taking advantage of the observed spatial and temporal dependencies, we propose a data-filling method to predict the missing data in the VM usage series. Detailed evaluation using trace data in the wild shows that the proposed method is sufficiently accurate as it achieves an average of 20% absolute percentage errors. We also illustrate its usefulness via a use case.
more » « less
Full Text Available
Spatial–Temporal Prediction Models for Active Ticket Managing in Data Centers

https://doi.org/10.1109/TNSM.2018.2794409

Xue, Ji; Birke, Robert; Chen, Lydia Y.; Smirni, Evgenia (March 2018, IEEE Transactions on Network and Service Management)

Full Text Available
Machine Learning Models for GPU Error Prediction in a Large Scale HPC System

https://doi.org/10.1109/DSN.2018.00022

Nie, Bin; Xue, Ji; Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian; Smirni, Evgenia; Tiwari, Devesh (June 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN))

GPUs are widely deployed on large-scale HPC systems to provide powerful computational capability for scientific applications from various domains. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative for reliability. In this paper, we first study the system conditions that trigger GPU errors using six-month trace data collected from a large-scale, operational HPC system. Then, we use machine learning to predict the occurrence of GPU errors, by taking advantage of temporal and spatial dependencies of the trace data. The resulting machine learning prediction framework is robust and accurate under different workloads.
more » « less
Full Text Available
Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities

https://doi.org/10.1109/MASCOTS.2017.12

Nie, Bin; Xue, Ji; Gupta, Saurabh; Engelmann, Christian; Smirni, Evgenia; Tiwari, Devesh (September 2017, 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS))

GPUs have become part of the mainstream high performance computing facilities that increasingly require more computational power to simulate physical phenomena quickly and accurately. However, GPU nodes also consume significantly more power than traditional CPU nodes, and high power consumption introduces new system operation challenges, including increased temperature, power/cooling cost, and lower system reliability. This paper explores how power consumption and temperature characteristics affect reliability, provides insights into what are the implications of such understanding, and how to exploit these insights toward predicting GPU errors using neural networks.
more » « less
Full Text Available

Search for: All records